CRISP-DM stands for cross-industry process for data mining. It provides a structured approach to planning a data mining project. It is a robust and well-proven methodology.
For Lab One we will use the first two portions of this methodology. We will start by focusing on understanding the objectives and requirements from a business perspective, and then using this knowledge to define the data problem and project plan. Next we will identify data quality issues, discover initial insights, or to detect interesting nuggets of information that might for a hypothesis for analysis.
In future labs we will execute all of the CRISP-DM steps.
The first stage of the CRISP-DM process is to understand what you want to accomplish from a business perspective. We will define our objectives and constraints that must be properly balanced. The goal of this stage of the process is to uncover important factors that could influence the outcome of our project.
This analysis of Home Credit's Default Risk dataset will focus on generating accurate loan default risk probabilities. Predicting loan defaults is essential to the profitability of banks and, given the competitive nature of the loan market, a bank that collects the right data can offer and service more loans. The target variable of the dataset is the binary label, 'TARGET', indicating whether the loan entered into default status or not.
List the resources available to the project including:
Cost
Benefits
Business success criteria
Data mining success criteria
Use Home Credit current customer data to predict whether a potential client is capable of repayment of the loan requested. During this process we will determine the features that are most influencial in determining this target variable.
The second stage of the CRISP-DM process requires you to acquire the data listed in the project resources. This initial collection includes data loading, if this is necessary for data understanding. For example, if you use a specific tool for data understanding, it makes perfect sense to load your data into this tool. If you acquire multiple data sources then you need to consider how and when you're going to integrate these.
Initial data collection report - List the data sources acquired together with their locations, the methods used to acquire them and any problems encountered. Record problems you encountered and any resolutions achieved. This will help both with future replication of this project and with the execution of similar future projects.
# Import Libraries Required.
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import seaborn as sns
# import custom code
from cleaning import read_clean_data, missing_values_table, load_bureau, create_newFeatures, merge_newFeatures
from tables import count_values_table
# some defaults
pd_max_rows_default = 60
# load data
# path = './application_train.csv'
# note that XNA is a encoding for NA interpret as np.nan
df = pd.read_csv('./application_train.csv',
na_values = ['XNA'])
#loading bureau dataset
bureau = pd.read_csv('./bureau.csv',
na_values = ['XNA'])
#bureau = load_bureau()
#newFeatures = create_newFeatures(bureau)
#df = df.merge(newFeatures, on = 'SK_ID_CURR', how = 'left')
Data description report - Describe the data that has been acquired including its format, its quantity (for example, the number of records and fields in each table), the identities of the fields and any other surface features which have been discovered. Evaluate whether the data acquired satisfies your requirements.
We will use two of the files from the total dataset.
application_train.csv: Information provided with each loan applicationbureau.csv: Information regarding clients from the credit bureausThe two data files can be joined on the loan id (SK_ID_CURR).
There are 122 features and 307511 observations in application_train.csv.
df.shape
The following features are loan application attributes from application_train.csv.
Descriptions, types, and and units are given for each feature.
In the original dataset, there were a large number of features related to building features included with the loan applications.
However, these features could not be used due to the large number of missing values.
| Feature | Description | Type | Units |
|---|---|---|---|
| SK_ID_CURR | ID of loan in our sample | Category | N/A |
| TARGET | Target Variable (1 - difficulty paying loan, 0 - all other cases) | Category | N/A |
| NAME_CONTRACT_TYPE | Identification if loan is cash or revolving | Category | N/A |
| CODE_GENDER | Gender of the client (M - male, F - female) | Category | N/A |
| FLAG_OWN_CAR | Flag if the client owns a car | Category | N/A |
| FLAG_OWN_REALTY | Flag if client owns a house or flat | Category | N/A |
| CNT_CHILDREN | Number of children the client has | Coninuous | N/A |
| AMT_INCOME_TOTAL | Income of the client | Coninuous | Currency |
| AMT_CREDIT | Credit amount of the loan | Coninuous | Currency |
| AMT_ANNUITY | Loan annuity | Coninuous | Currency |
| AMT_GOODS_PRICE | For consumer loans it is the price of the goods for which the loan is given | Coninuous | Currency |
| NAME_TYPE_SUITE | Who was accompanying client when he was applying for the loan | Category | N/A |
| NAME_INCOME_TYPE | Clients income type (businessman, working, maternity leave) | Category | N/A |
| NAME_EDUCATION_TYPE | Level of highest education the client achieved | Category | N/A |
| NAME_FAMILY_STATUS | Family status of the client | Category | N/A |
| NAME_HOUSING_TYPE | What is the housing situation of the client (renting, living with parents, ...) | Category | N/A |
| REGION_POPULATION_RELATIVE | Normalized population of region where client lives (higher number means the client lives in more populated region) | Coninuous | Days |
| DAYS_BIRTH | Client's age in days at the time of application | Coninuous | Days |
| DAYS_EMPLOYED | How many days before the application the person started current employment | Coninuous | Days |
| DAYS_REGISTRATION | How many days before the application did client change his registration | Coninuous | Days |
| DAYS_ID_PUBLISH | How many days before the application did client change the identity document with which he applied for the loan | Coninuous | Days |
| OWN_CAR_AGE | Age of client's car | Coninuous | Months |
| FLAG_MOBIL | Did client provide mobile phone (Y, N) | Category | N/A |
| FLAG_EMP_PHONE | Did client provide work phone (Y, N) | Category | N/A |
| FLAG_WORK_PHONE | Did client provide home phone (Y, N) | Category | N/A |
| FLAG_CONT_MOBILE | Was mobile phone reachable (Y, N) | Category | N/A |
| FLAG_PHONE | Did client provide home phone (Y, N) | Category | N/A |
| FLAG_EMAIL | Did client provide email (Y, N) | Category | N/A |
| CNT_FAM_MEMBERS | What kind of occupation does the client have | Category | N/A |
| OCCUPATION_TYPE | How many family members does client have | Category | N/A |
| REGION_RATING_CLIENT | Our rating of the region where client lives (1,2,3) | Category | N/A |
| REGION_RATING_CLIENT_W_CITY | Our rating of the region where client lives with taking city into account (1,2,3) | Category | N/A |
| WEEKDAY_APPR_PROCESS_START | On which day of the week did the client apply for the loan | Category | N/A |
| HOUR_APPR_PROCESS_START | Approximately at what hour did the client apply for the loan | Category | N/A |
| REG_REGION_NOT_LIVE_REGION | Flag if client's permanent address does not match contact address (1=different, 0=same, at region level) | Category | N/A |
| REG_REGION_NOT_WORK_REGION | Flag if client's permanent address does not match work address (1=different, 0=same, at region level) | Category | N/A |
| LIVE_REGION_NOT_WORK_REGION | Flag if client's contact address does not match work address (1=different, 0=same, at region level) | Category | N/A |
| REG_CITY_NOT_LIVE_CITY | Flag if client's permanent address does not match contact address (1=different, 0=same, at city level) | Category | N/A |
| REG_CITY_NOT_WORK_CITY | Flag if client's permanent address does not match work address (1=different, 0=same, at city level) | Category | N/A |
| LIVE_CITY_NOT_WORK_CITY | Flag if client's contact address does not match work address (1=different, 0=same, at city level) | Category | N/A |
| ORGANIZATION_TYPE | Type of organization where client works | Category | N/A |
| EXT_SOURCE_1 | Normalized score from external data source | Coninuous | N/A |
| EXT_SOURCE_2 | Normalized score from external data source | Coninuous | N/A |
| EXT_SOURCE_3 | Normalized score from external data source | Coninuous | N/A |
| OBS_30_CNT_SOCIAL_CIRCLE | How many observation of client's social surroundings with observable 30 DPD (days past due) default | Coninuous | N/A |
| DEF_30_CNT_SOCIAL_CIRCLE | How many observation of client's social surroundings defaulted on 30 DPD (days past due) | Coninuous | N/A |
| OBS_60_CNT_SOCIAL_CIRCLE | How many observation of client's social surroundings with observable 60 DPD (days past due) default | Coninuous | N/A |
| DEF_60_CNT_SOCIAL_CIRCLE | How many observation of client's social surroundings defaulted on 60 (days past due) DPD | Coninuous | N/A |
| DAYS_LAST_PHONE_CHANGE | How many days before application did client change phone | Coninuous | N/A |
| FLAG_DOCUMENT_2 | Did client provide document 2 | Category | N/A |
| FLAG_DOCUMENT_3 | Did client provide document 3 | Category | N/A |
| FLAG_DOCUMENT_4 | Did client provide document 4 | Category | N/A |
| FLAG_DOCUMENT_5 | Did client provide document 5 | Category | N/A |
| FLAG_DOCUMENT_6 | Did client provide document 6 | Category | N/A |
| FLAG_DOCUMENT_7 | Did client provide document 7 | Category | N/A |
| FLAG_DOCUMENT_8 | Did client provide document 8 | Category | N/A |
| FLAG_DOCUMENT_9 | Did client provide document 9 | Category | N/A |
| FLAG_DOCUMENT_10 | Did client provide document 10 | Category | N/A |
| FLAG_DOCUMENT_11 | Did client provide document 11 | Category | N/A |
| FLAG_DOCUMENT_12 | Did client provide document 12 | Category | N/A |
| FLAG_DOCUMENT_13 | Did client provide document 13 | Category | N/A |
| FLAG_DOCUMENT_14 | Did client provide document 14 | Category | N/A |
| FLAG_DOCUMENT_15 | Did client provide document 15 | Category | N/A |
| FLAG_DOCUMENT_16 | Did client provide document 16 | Category | N/A |
| FLAG_DOCUMENT_17 | Did client provide document 17 | Category | N/A |
| FLAG_DOCUMENT_18 | Did client provide document 18 | Category | N/A |
| FLAG_DOCUMENT_19 | Did client provide document 19 | Category | N/A |
| FLAG_DOCUMENT_20 | Did client provide document 20 | Category | N/A |
| FLAG_DOCUMENT_21 | Did client provide document 21 | Category | N/A |
| AMT_REQ_CREDIT_BUREAU_HOUR | Number of enquiries to Credit Bureau about the client one hour before application | Category | N/A |
| AMT_REQ_CREDIT_BUREAU_DAY | Number of enquiries to Credit Bureau about the client one day before application (excluding one hour before application) | Category | N/A |
| AMT_REQ_CREDIT_BUREAU_WEEK | Number of enquiries to Credit Bureau about the client one week before application (excluding one day before application) | Category | N/A |
| AMT_REQ_CREDIT_BUREAU_MON | Number of enquiries to Credit Bureau about the client one month before application (excluding one week before application) | Category | N/A |
| AMT_REQ_CREDIT_BUREAU_QRT | Number of enquiries to Credit Bureau about the client 3 month before application (excluding one month before application) | Category | N/A |
| AMT_REQ_CREDIT_BUREAU_YEAR | Number of enquiries to Credit Bureau about the client one day year (excluding last 3 months before application) | Category | N/A |
Features from the bureau data are not used directly; new features are engineer. Descriptions for the base features that are used in feature engineering are described here. For details on the engineered features see section 2.5 Feature Engineering.
| Feature | Description | Type | Units |
|---|---|---|---|
| SK_ID_CURR | ID of loan in our sample | Category | N/A |
| CREDIT_ACTIVE | The statue of reported credits (Open or Close) | Category | N/A |
| CREDIT_DAY_OVERDUE | The number of days a given credit is over due | Continuous | Days |
| AMT_CREDIT_SUM | Current credit amount for the Credit Bureau credit | Continuous | Currency |
| AMT_CREDIT_SUM_DEBT | Current debt on Credit Bureau credit | Continuous | Currency |
| AMT_CREDIT_SUM_LIMIT | Current credit limit of credit card reported in Credit Bureau | Continuous | Currency |
| AMT_CREDIT_SUM_OVERDUE | Current amount overdue on Credit Bureau credit | Continuous | Currency |
Note: All this resolutions to data quality described in this section are implemented in cleaning.py,
which is used to load the cleaned dataset.
In addition to incorrect datatypes, another common problem when dealing with real-world data is missing values. These can arise for many reasons and have to be either filled in or removed before we train a machine learning model. First, let’s get a sense of how many missing values are in each column
While we always want to be careful about removing information, if a column has a high percentage of missing values, then it probably will not be useful to our model. The threshold for removing columns should depend on the problem
# create missing values table with all rows
pd.set_option('display.max_rows', 129)
missing_values_table(df)
# return row display setting to default
pd.set_option('display.max_rows', pd_max_rows_default)
Each feature or set of features will be discussed in descending order of the rate of missing values.
Building Features with Missing Values
A large number of the features with values are normlaized infromation about the build where the client lives (such as ENTRANCES_MODE or COMMONAREA_AVG).
It is plausible that only some of these features exist in a given client's building.
It is also plausible that some of the records are missing (not provided by the client).
There does not appear to be an indication if values are missing or not applicable to the client.
All of these values have a missing rate above 40%.
Missing values in OWN_CAR_AGE
This feature refers to the age of cars owned by the client.
Approximately 66.0% of the values are empty.
However, there is also a feature FLAG_OWN_CAR, which indicates that the client owns a car.
It is reasonable to expect that values will be missing if the client does not own a car.
For clients reporting owning a car, all but 5 car age values are present.
For clients reporting not owning a car, no car ages are reported.
Since the missing rate is actually very small, these missing values could be imputed.
# get indexes of OWN_CAR_AGE that are not NA
car_age_indexes = df[df.OWN_CAR_AGE.notnull()].index
# get indexes of FLAG_OWN_CAR with value equal to Y (client owns car)
owns_car_indexes = df.query('FLAG_OWN_CAR == "Y"').index
print('There are {} records with ages for cars'.format(car_age_indexes.shape[0]))
print('There are {} records indicating that the client owns a car'.format(owns_car_indexes.shape[0]))
# get the number of non-null car age values on records where client does not list a car
car_age_not_own_car = df.query('FLAG_OWN_CAR == "N"').OWN_CAR_AGE.notnull().sum()
print('There are {} car ages reported for clients that report NOT owning a car'.format(car_age_not_own_car))
Missing values in EXT_SOURCE_1, EXT_SOURCE_2, and EXT_SOURCE_3
Since there are zeros in these columns,
we expect that the missing values represent lack of the external source these clients.
Therefore, we will add an encoding feature EXT_SOURCE_<number>_AV
for each of the external source features that represents the presence
of external information.
These original columns will only be used as an interaction with these
encoding features and the missing values will be filled with zero.
Missing values in OCCUPATION_TYPE
There does not appear to be a indication that OCCUPATION_TYPE is systematically missing.
We will assume that the client did not provide this information and impute with a new categorical level Unknown.
OCCUPATION_TYPE has a missing rate of 31.3%.
Missing values in ORGANIZATION_TYPE
Later in the outliers section, it is shown that the NAs in ORGANIZATION_TYPE are associated with occupations listed as 'Unemployed' or 'Pensioners'.
We will take this to mean that these clients are not assocated with any employer and will impute these NAs with 'None'.
ORGANIZATION_TYPE has a missing rate of 18%.
Missing values for AMT_REQ_CREDIT_BUREAU Features
These features indicate the number of credit enquiries on the client in a
given time period: hour, day, week, month, quarter, and year.
There does not appear to be a systematic reason the missing values in
these features.
We will treat these records as if there are no credit enquires for these
records and impute with zero.
These features have a missing rate of 13.5%.
Remaining Features with Low Missing Rate
The remaining missing features are missing at rates below 0.5%.
Imputation on these features should have only a small impact on the total dataset.
We will use standard imputation strategies for these features:
imputation of the mode for categorical features and imputation of the median for continuous features.
We are imputing with continuous features median rather than mode because the continuous features are skewed and
median is not impacted by large values in the features.
pd.set_option('display.max_rows', 122)
data = read_clean_data()
missing_values_table(data)
# load the bureau dataset if not loaded
if 'bureau' not in globals():
bureau = pd.read_csv('./bureau.csv', na_values = ['XNA'])
#engineering features from bureau dataset
newFeatures = create_newFeatures(bureau)
newFeatures.head()
newFeatures = newFeatures.fillna(0)
# merge datasets on loan ID
data = data.merge(newFeatures, on = 'SK_ID_CURR', how = 'left')
# fill all NAs with 0 in the new columns
fn_columns = list(newFeatures.columns)
for name in fn_columns:
data[name] = data[name].fillna(0)
missing_values_table(data)
At this point, we may also want to remove outliers. These can be due to typos in data entry, mistakes in units, or they could be legitimate but extreme values. For this project, we will remove anomalies based on the definition of extreme outliers:
https://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm
DAYS_EMPLOYED¶There are a large number of entries for DAYS_EMPLOYED outside the main distribution.
These entries are at value 365243 and there are 55374 instances.
df.DAYS_EMPLOYED.hist(bins = 50);
df.query("DAYS_EMPLOYED >= 100000").DAYS_EMPLOYED.shape
df.query("DAYS_EMPLOYED >= 100000").DAYS_EMPLOYED.head()
These high values for DAYS_EMPLOYED appear to be associated with clients that are "Pensioners" or Unemployed and do not list an employment orgainization (ORGANIZATION_TYPE).
# get the instances with NAME_INCOME_TYPE either Pensioner or Unemployed
filtered_index = df.query('NAME_INCOME_TYPE == "Pensioner" | NAME_INCOME_TYPE == "Unemployed"')
# filter to NAs for ORGANIZATION_TYPE and get the index of the array
filtered_index = df[df.ORGANIZATION_TYPE.isna()].index
# get the indexes of the high values
high_val_index = df.query("DAYS_EMPLOYED >= 100000").index
# assert that indexes are the same
if np.equal(filtered_index, high_val_index).all():
print('Index of queried values are the same.')
else:
print('Indexes of queried values are different.')
Possible Solution
A possible solution to this data quality issue would be to add a None level to ORGANIZATION_TYPE and only use DAYS_EMPLOYED as an interaction with ORGANIZATION_TYPE.
All the cleaning discussed in the sections above are implemented in cleaning.py.
This script contains a function (read_clean_data) to apply the cleaning steps and return the cleaned dataset for work.
Details
Sample Output
data = read_clean_data()
data.head(2)
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport
# Generate the pandas_profile report
#profile = ProfileReport(data, )
profile = ProfileReport(data, minimal=True)
#Create HTML Frame in notebook
profile.to_notebook_iframe()
This part of the exploration focused on the use of box plots and histograms for visualing continuous variables and bar charts for visualizing categorical variables. Variables we expect to be important to the important to analysis were selected for univariate visualization.
fig, axes = plt.subplots(2,2, figsize=(10, 10))
# histogram of total income
axes[0,0].hist(np.log10(data.AMT_INCOME_TOTAL), bins = 30)
axes[0,0].set_xlabel('log10 of Total Income')
axes[0,0].set_title('Histogram of Total Income (Log-Transformed)');
# boxplot of total income
axes[1,0].boxplot(np.log10(data.AMT_INCOME_TOTAL), vert = False)
axes[1,0].set_xlabel('log10 of Total Income')
axes[1,0].set_title('Boxplot of Total Income (Log-Transformed)');
# histogram of annuity
axes[0,1].hist((data.AMT_ANNUITY), bins = 30)
axes[0,1].set_xlabel('Amount Annuity')
axes[0,1].set_title('Histogram of Amount Annuity');
# boxplot of annuity
axes[1,1].boxplot((data.AMT_ANNUITY), vert = False)
axes[1,1].set_xlabel('Amount Annuity')
axes[1,1].set_title('Boxplot of Amount Annuity');
Financial Features
We expect that financial features will be important to the analysis. We expect that features such as AMT_INCOME_TOTAL would be relevant for a client's ability to payback a loan. Additionally, a client's income, especially large values and small values may influence a bank's willness to provide credit to a client. However, clients with extremely large income may receive special treatment due to the potential value.
Due to the skewed nature of these distributions, a histgram and a boxplot are shown. The histogram shows the features of the main distribution, while the behavior of the tails and extreme values are shown in the boxplots.
AMT_INCOME_TOTAL
For ease of visualization, AMT_INCOME_TOTAL was transformed with log base-10. This is due to influence from some very large observations on histogram binning. As shown in the pair of plots for AMT_INCOME_TOTAL, most of the values are clustered just above 100,000 (5). The main section of the log transformed distribution appears to relatively close to a normal (the distribution is right-skewed on the original scale). Several extreme values are shown in the box plot. The most extreme value is approximately 1000 times larger than the median and mean of the distribution.
AMT_ANNUITY
Like AMT_INCOME_TOTAL, AMT_ANNUITY has a right skewed distribution. However, this feature could be plotted on the original scale because the scale is not as large as AMT_INCOME_TOTAL. The histogram shows that most of the values are between 0 and 50,000. The boxplot shows that there are some exteme values outside the main distribution. The largest extreme value is approximately 10 times the median value.
For the finacial features we expect that some kind of transformation may be necessary to reduce the influence of the extreme values. We may also consider removing very large outliers as they may be treated differently by the bank. For instance, clients with very large incomes may be work personally with a bank representative rather than interacting with an automated system.
fig, axes = plt.subplots(1,2, figsize=(10, 5))
# histogram of age
axes[0].hist(data.DAYS_BIRTH / 365, bins = 35)
axes[0].set_xlabel('Client Age (Years)')
axes[0].set_title('Histogram of Client Age');
# histogram of employment tenure
# filter out the clients that do not appear to be employed
employed = data.query('NAME_INCOME_TYPE != "Pensioner" & NAME_INCOME_TYPE != "Unemployed"')
axes[1].hist((employed.DAYS_EMPLOYED / 365), bins = 30)
axes[1].set_xlabel('Tenure with Current Employer (Years)')
axes[1].set_title('Histogram of Client Tenure with Employer');
DAYS_BIRTH
We expect that client age may be an important factor in determining a client's ability to pay back a loan. Higher ages may represent more time to accumulate wealth and more experience dealing with financial matter. The ages are transformed into years (the original scale is days) because years is a more interpretable unit than days when the values are large. Client ages appear to be almost uniform. The majoity of clients are around 25 to 65 years of age.
DAYS_EMPLOYED
We expect that tenture may be useful in predicting loan repayment ability. In some cases, individuals that have stayed with an employer for a long period of time are compensated well. Since values are large, the tenure with most recent employer are plotted in years rather than days. Tenure with current employer is very right-skewed. There is a steep decrease in clients tenure from 0 years to approximately 15 years. There appears to be an inflection point around 15 years of tenure and the rate of decrease of the distribution tramatically slows. Clients that were listed as pensioners or unemployed were filtered out of this distribution.
axes = count_values_table(data.NAME_HOUSING_TYPE).plot.bar()
axes.set_xlabel('Housing Type')
axes.set_ylabel('Number of Clients')
axes.set_title('Number of Clients By Type of Housing');
NAME_HOUSING_TYPE
Almost all clients reported living in a house or apartment. Only a relatively small portion of clients reported in other types of living situations. The skewness of this categorical variable could be an issue because of the bias toward a single response.
axes = count_values_table(data.OCCUPATION_TYPE).plot.bar();
axes.set_xlabel('Name of Occupation Type')
axes.set_ylabel('Number of Clients')
axes.set_title('Number of Clients By Occupation Type');
OCCUPATION_TYPE
We expect that the client's type of occupation may affect a client's ability to repay loans. The bar chart shows the counts of occupation type reported by clients. We see that many occupation types were reported. This feature has a high cardinality with low counts in about two thirds in of the categories. Oddly, a large number did not report an occuption type; these are listed as "Unknown." It is possible that too many clients reported no type of occupation for this feature to be useful.
To begin the process of exploring the relationships between continuous variables, we rely on two strategies:
A limitation of this strategy is that scatterplot matrices can only visualize a limited number of variables at once, but a deeper mathmatical exploration in the form of Component Analysis will be performed as well.
To analyze the distribution of the target variable across the categorical variables, we rely primarily on:
Additionally, we will view the relationship between the continuous variables and the target variable scatterplot matrices colored by group membership, where the groups are the target variable. This visualization method will help identify variables that provide the greatest discrimination between the two levels of TARGET.
Scatterplot Matrices of Coninuous Variables
import matplotlib.pyplot as plt
#from mlxtend.plotting import scatterplotmatrix
cols = ['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_GOODS_PRICE','AMT_ANNUITY', 'AMT_CREDIT_SUM', 'AMT_CREDIT_SUM_DEBT',
'AMT_CREDIT_SUM_LIMIT', 'AMT_CREDIT_SUM_OVERDUE']
g = sns.pairplot(data[cols])
#g.map_lower(sns.kdeplot, cmap="Blues_d") # use joint kde on the lower triangle
#g.map_upper(plt.scatter) # scatter on the upper
#g.map_diag(sns.kdeplot, lw=3, legend = False) # kde histogram on the diagonal
g.fig.suptitle('Relationships: AMT_INCOME_TOTAL, AMT_CREDIT, AMT_GOODS_PRICE, AMT_ANNUITY', y= 1.08)
plt.tight_layout()
plt.show()
The scatterplot matrix revealed the impact of significant outliers in the AMT_INCOME_TOTAL feature. The most significant outlier is more than ten times that of the next greatest income.
A collinear relationship between AMT_GOODS_PRICE and AMT_CREDIT also exists, representing the fact that cash credit is awarded on the basis of the value of the goods purchased. Also, the AMT_CREDIT, AMT_ANNUITY, and AMT_GOODS_PRICE variables appear to have almost identical relationships. The AMT_GOODS_PRICE feature will be considered for elimination, but kept in the training set until the component analysis is complete.
To observe the behavior of applicants based on income, we will bin them into quantiles using the AMT_INCOME_TOTAL feature and plot them against TARGET.
#create quantiles
CAT_INCOME = pd.qcut(data['AMT_INCOME_TOTAL'], q = 4)
print(CAT_INCOME.head())
We chose to visualize the distribution of loan defaults across income
quantiles with a heatmap because it will quickly allow us to assess the
distribution of the quantiles and identify the quantiles where the
majority of defaults exist in. This knowledge will help determine
whether removing the high income outliers from consideration will be
detrimental to the model's ability to discriminate between the two
levels of the TARGET variable.
default_income = list(zip(data.TARGET.astype(bool), CAT_INCOME))
def_by_inc = pd.DataFrame(default_income, columns = ['TARGET', 'INCOME_CAT'])
def_by_inc = pd.crosstab(index = def_by_inc.TARGET, columns = def_by_inc.INCOME_CAT).apply(lambda x: (x/x.sum())*100, axis = 1)
def_by_inc.head()
p = sns.heatmap(def_by_inc, vmin = 15, vmax = 40, annot = True, linewidths = .5, cmap = 'coolwarm',
annot_kws={'fontsize': 10,
'verticalalignment':'center'})
p.set(title = 'Defaults by Income Category',
ylabel = 'Loan Status - True = Default',
xlabel = 'Income Ranges')
A heatmap of loan defaults by income brackets indicates that a majority of loan defaults occur from the lower end of the income spectrum, within the range of 25,649 and 112,500. This observation points to a modest impact on a model where the highest income earners are removed from consideration.
A heatmap is chosen to visualize the correlation matrix of the large scale continuous variables because of the ease of determining multicollinearity among the predictor variables.
# creating correlation matrix
corr = data[cols].corr()
#generate mask for upper tirangle
mask = np.zeros_like(corr, dtype = np.bool)
mask[np.triu_indices_from(mask)] = True
# Set up matplotlib figure
f, ax = plt.subplots(figsize=(11,9))
# Draw heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap = 'coolwarm', center = 0.5,
linewidth = .5, cbar_kws={'shrink': .5},
square = True, annot = True)
The correlation plot reveals a 99% correlation between the AMT_GOODS_CREDIT and AMT_CREDIT features. AMT_GOODS_CREDIT will be removed from consideration because it is collinear with AMT_ANNUITY as well as AMT_CREDIT and will cause interference as we try to determine feature importance.
g = sns.pairplot(sansOutlier[cols])
g.fig.suptitle('Variables Without the Outlier', y= 1.08)
plt.tight_layout()
plt.show()
#changing the TARGET variable to categorical for easier visualization
data['TARGET'] = data.TARGET.astype('category')
import seaborn as sns
sns.set(style="ticks")
engFeatures = ['CREDIT_ACTIVE','CREDIT_INCOME_RATIO','ANNUITY_INCOME_RATIO','PERCENT_EMPLOYED_TO_AGE', 'TARGET']
sns.pairplot(data[engFeatures], hue = 'TARGET')
plt.show()
The kernel density plots reveal that applicants with fewer active accounts, total accounts, and less available credit tend to have a higher risk of defaulting on loans, as expected. The outlier default whose income was in excess of 117 million dollars is also clearly evident in most of the plots and would likely have a strong impact on a final model. This provides more evidence for the removal of the millionaire default outlier.
creditSums = ['AMT_CREDIT_SUM_OVERDUE', 'AMT_CREDIT_SUM_LIMIT', 'AMT_CREDIT_SUM_DEBT', 'AMT_CREDIT_SUM', 'TARGET']
sns.pairplot(data[creditSums], hue = 'TARGET')
plt.show()
When analyzing the distribution of hte TARGET variable across the continuous credit features, the impact of the outlier is even more pronounced, making it impossible to view the features that will help identify features critical to predicting defaults.
defaultsHouse_loan = pd.crosstab([data['NAME_HOUSING_TYPE'], data['NAME_CONTRACT_TYPE']],
data.TARGET.astype(bool))
defaultsHouse_loan.plot(kind = 'barh')
plt.suptitle("Defaults by Housing and Loan Type ")
plt.xlabel('Default Count')
plt.ylabel('Housing and Loan Types')
plt.show()
This barplot reveals that a vast majority of the loans in the dataset are cash loans offered to people who own their house or apartment.
defaultsActiveAcc = pd.crosstab([data.NAME_INCOME_TYPE, data.FLAG_OWN_CAR], data.TARGET.astype(bool))
defaultsActiveAcc.plot(kind = 'barh')
Differencing Specifically, a new series is constructed where the value at the current time step is calculated as the difference between the original observation and the observation at the previous time step. value(t) = observation(t) - observation(t-1)
#print(data.columns)
flags = ['FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE',
'FLAG_PHONE', 'FLAG_EMAIL', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3',
'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6',
'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9',
'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12',
'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15',
'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18',
'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21']
#flagData = data[flags]
#flag_defaults = flagData.groupby('TARGET').count()
#sns.heatmap(flag_defaults)
#flag_defaults
data[flags].groupby(data.TARGET).describe()
# I want the frequency (as a percentage) of the target variable across all FLAG features. This doesn't work...
realty_flag = pd.crosstab(index = data.TARGET, columns = data[flags]).apply(lambda x: (x/x.sum())*100, axis = 1)
realty_flag
#pd.crosstab(index = def_by_inc.TARGET, columns = def_by_inc.INCOME_CAT).apply(lambda x: (x/x.sum())*100, axis = 1)
The following features were engineered from the loan application features (from application_train.csv).
| Engineered Feature | Description | Type | Units | Formula |
|---|---|---|---|---|
| CREDIT_INCOME_RATIO | The percentage credit relative to client's income | Numeric | N/A | AMT_CREDIT / AMT_INCOME_TOTAL |
| ANNUITY_INCOME_RATIO | The percentage annunity relative to client's income | Numeric | N/A | AMT_ANNUITY / AMT_INCOME_TOTAL |
| PERCENT_EMPLOYED_TO_AGE | The fraction of client's days employed. | Numeric | N/A | DAYS_EMPLOYED / DAYS_BIRTH |
The following features were engineered from the bureau features (from bureau.csv).
| Engineered Feature | Description | Type | Units | Formula |
|---|---|---|---|---|
| LOAN_COUNT | The total number of accounts, active and closed. | Numeric | N/A | length of CREDIT_ACTIVE, grouped by loan ID |
| CREDIT_ACTIVE | A count of active credit accounts by loan ID | Numeric | N/A | len(bureau['CREDIT_ACTIVE'] == 'Active') |
| CREDIT_DAY_OVERDUE | A count of days overdue for active credit accounts by loan ID | Numeric | N/A | sum of CREDIT_DAY_OVERDUE for CREDIT_ACTIVE == Active, grouped by loan ID |
| AMT_CREDIT_SUM | Total credit available from active accounts | Numeric | N/A | sum of AMT_CREDIT_SUM for CREDIT_ACTIVE == Active, grouped by loan ID |
| AMT_CREDIT_SUM_DEBT | Total debt of active accounts | Numeric | N/A | sum of AMT_CREDIT_SUM_DEBT for CREDIT_ACTIVE == Active, grouped by loan ID |
| AMT_CREDIT_SUM_LIMIT | Overall credit limit of active accounts | Numeric | N/A | sum of AMT_CREDIT_SUM_LIMIT for CREDIT_ACTIVE == Active, grouped by loan ID |
| AMT_CREDIT_SUM_OVERDUE | Total amount overdue | Numeric | N/A | sum of AMT_CREDIT_SUM_OVERDUE for CREDIT_ACTIVE == Active, grouped by loan ID |